Introduction

We will be analyzing a data, which are label-free, bottom-up and generated by data-dependent acquisition. Raw spectra were searched using MaxQuant software, and we will start with the resulting proteinGroups.txt file, containing information about proteins identified and their quantification.

It is an interactomics study of ubiquitin interactors, based on articles of Zhang, et al. 2017 and Zhang et al., 2018.

For the analysis, we will be using DEP package (Differential Enrichment analysis of Proteomics data).

Libraries required

We will be using several libraries:

  • here (path definitions)
  • DEP (proteomics workflow)
  • dplyr (basic data operations)
  • SummarizedExperiment (normalization purposes)
  • DT (output table generation)

Install the required packages using install.packages() or BiocManager, don’t forget to comment the installation commands afterwards.

# install.packages("package") 
# OR
# library("BiocManager")
# BiocManager::install("package")

library(here)
library(DEP)
library(dplyr)
library(DT)
library(SummarizedExperiment)

Data input

On input, there is usually proteinGroups.txt file, generated by MaxQuant software. Read it into R using read.delim() or read.delim2() function.

data <- read.delim(here('data', 'proteinGroups.txt'))
#data <- UbiLength # alternatively, use this command to get the data
nrow(data)
## [1] 3006

Data preparation

Contaminants filtering

We will filter out the most common contaminants:

  • Reverse sequences
  • Only identified by site proteins
  • cRAP protein sequences (e.g common laboratory proteins, proteins added accidentally by dust/physical contact, etc.)
  • keratins

By change of nrow() you can see how many proteins were filtered out

data <- data %>%
  filter(Reverse != "+") %>%
  filter(!grepl("cRAP", Majority.protein.IDs)) %>%
  filter(Only.identified.by.site != "+") %>%
  filter(!grepl("keratin", Fasta.headers)) %>%
  filter(!grepl("Keratin", Fasta.headers)) %>%
  filter(!grepl("PE=5", Fasta.headers))

nrow(data)
## [1] 2910

Unique identifiers

Dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique. For further analysis these proteins must get unique names. Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID.

# Are there any duplicated gene names?
data$Gene.names %>% duplicated() %>% any()
## [1] TRUE
# Make a table of duplicated gene names
data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>% 
  arrange(desc(frequency)) %>% filter(frequency > 1)
## # A tibble: 46 x 2
##    Gene.names frequency
##    <chr>          <int>
##  1 ""                11
##  2 "ATXN2"            4
##  3 "SF1"              4
##  4 "ATXN2L"           3
##  5 "RBM33"            3
##  6 "UGP2"             3
##  7 "ACTL6A"           2
##  8 "BCLAF1"           2
##  9 "BRAP"             2
## 10 "CALU"             2
## # ... with 36 more rows
# Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name.
data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";")

SummarizedExperiment

We will now create a SummarizedExperiment object and further work with it instead of classical dataframe.

For that, we need the columns containing intensities, and experimental design.

Important: by creating SummarizedExperiment, the protein intensities get log2 transformed! No need for a separate log2 transformation.

# grep the intensity columns
intensity_columns <- grep("LFQ.", colnames(data_unique)) 
# more often you will be using Intensity instead of LFQ intensity, so grep("Intensity.", colnames(data_unique))

# create experimental design
exp_design <- data.frame(
  label = c("Ubi4_1", "Ubi4_2", "Ubi4_3", 
            "Ubi6_1", "Ubi6_2", "Ubi6_3", 
            "Ctrl_1", "Ctrl_2", "Ctrl_3", 
            "Ubi1_1", "Ubi1_2", "Ubi1_3"),
  condition = c("Ubi4", "Ubi4", "Ubi4", 
                "Ubi6", "Ubi6", "Ubi6", 
                "Ctrl", "Ctrl", "Ctrl", 
                "Ubi1", "Ubi1", "Ubi1"),
  replicate = c(rep(1:3, times = 4))
)

# define the variable types
exp_design$label <- as.character(exp_design$label)
exp_design$condition <- as.character(exp_design$condition)
exp_design$replicate <- as.numeric(exp_design$replicate)

data_se <- make_se(data_unique, intensity_columns, exp_design)

SummarizedExperiment intermezzo

# plot identifications per each condition
data_se
## class: SummarizedExperiment 
## dim: 2910 12 
## metadata(0):
## assays(1): ''
## rownames(2910): RBM47 UBA6 ... ATXN2.3 X6RHB9
## rowData names(13): Protein.IDs Majority.protein.IDs ... name ID
## colnames(12): Ubi4_1 Ubi4_2 ... Ubi1_2 Ubi1_3
## colData names(4): label ID condition replicate
dim(data_se)
## [1] 2910   12
#dimnames(data_se)
#colData(data_se)
#assay(data_se)
head(assay(data_se))
##          Ubi4_1   Ubi4_2   Ubi4_3   Ubi6_1   Ubi6_2   Ubi6_3   Ctrl_1   Ctrl_2
## RBM47  25.09293 24.55807 24.83147 24.75496 24.47172 24.96301 24.62634 24.89851
## UBA6         NA       NA       NA       NA       NA       NA 22.79170 23.03797
## ILVBL        NA       NA       NA       NA       NA       NA 24.83388 23.92488
## ACO2   24.63423 24.34157 24.55086 23.97132 24.53600 24.67417 24.62673 24.30529
## RPRD1B       NA       NA       NA       NA       NA       NA       NA       NA
## AAR2         NA       NA       NA       NA       NA       NA       NA       NA
##          Ctrl_3   Ubi1_1   Ubi1_2   Ubi1_3
## RBM47  24.45989 24.72151 24.94566 24.89027
## UBA6   22.89800 23.24012 23.56232 23.10530
## ILVBL  24.53979 24.66942 24.40756 24.67061
## ACO2   24.27831 24.68212 24.40139 23.78087
## RPRD1B       NA       NA 21.48871       NA
## AAR2         NA       NA 21.75776       NA
#data_se$condition
#data_se@assays@data@listData

Number of proteins

# plot identifications per each condition
plot_numbers(data_se)

Data filtering

Proteomics data contain many missing values, therefore we pre-filter the dataset to get rid of the most unreliable identifications.

Firstly, we plot a barplot of the protein identification overlap between samples:

plot_frequency(data_se)

Based on that we can choose the filtering threshold using either of two functions:

  • filter_missval(se, thr = 0)
  • filter_proteins(se, type = c(“complete”, “condition”, “fraction”), thr = NULL, min = NULL)

Thr means a threshold of allowed missing values in at least one condition. So, thr=1 means one missing value is allowed in at least one condition (2/3 replicates of >=1 condition)

data_filt <- filter_missval(data_se, thr = 1)
plot_numbers(data_filt)

plot_frequency(data_filt)

Normalization

We need to remove the technical variability using one of normalization approaches. DEP by default uses variance stabilizing normalization (vsn). We will try several more from the limma package:

Vsn normalization

data_norm_vsn <- normalize_vsn(data_filt)
meanSdPlot(data_norm_vsn)

plot_normalization(data_filt, data_norm_vsn)

Quantile normalization

data_norm_quant <- data_filt
assay(data_norm_quant)<-limma::normalizeQuantiles(assay(data_norm_quant))
meanSdPlot(data_norm_quant)

plot_normalization(data_filt, data_norm_quant)

LoessF normalization

data_norm_loess <- data_filt
assay(data_norm_loess)<-limma::normalizeCyclicLoess(assay(data_norm_loess))
meanSdPlot(data_norm_loess)

plot_normalization(data_filt, data_norm_loess)

Median normalization

data_norm_med <- data_filt
assay(data_norm_med)<-limma::normalizeMedianValues(assay(data_norm_med))
meanSdPlot(data_norm_med)

plot_normalization(data_filt, data_norm_med)

We will continue for now with the default, vsn normalization.

Imputation of NAs

Proteomic data contain high number of missing values. There are three different types of missing values:

  • MCAR + MAR: uniform occurence across data
  • MNAR: proteins below detection limit, low abundance proteins

Based on the type of missingness, we choose appropriate imputation algorithm:

For MNAR:

  • global minimum: imputes data by smallest non-missing value in the dataset
  • QRILC: imputation using quantile regression-based left-censored function
  • MinDet: imputation using minimal value, but sample-wise
  • MinProb: imputation by random draws from a Gaussian distribution centered to a minimal value (q-th quantile)
  • man: imputation by random draws from a manually defined left-shifted Gaussian distribution

For MCAR:

  • kNN: imputation by k-Nearest Neighbors averaging
  • MLE: Maximum likelihood-based imputation method

For both, MNAR+MCAR: - mixed imputation or use different package, e.g. imp4p

Plot heatmap of missing proteins in at least 1 condition (white = NA; black = present)

plot_missval(data_filt)

Plot intensity distributions and cumulative fraction of proteins with and without missing values

plot_detect(data_filt)

Impute the missing values

data_imp <- impute(data_norm_vsn, fun = "man", shift = 1.8, scale = 0.3)
plot_imputation(data_norm_vsn, data_imp)

Plot PCA for 500 most variable proteins

plot_pca(data_imp, x = 1, y = 2, n = 500, point_size = 4)

Differential expression

We want to get differentially expressed proteins now, between the bait and control. For that, we will be using limma test.

# Test every sample versus control
data_diff <- test_diff(data_imp, type = "control", control = "Ctrl")

# Or we can manually define contrasts as well:
# Test manually defined comparisons
#data_diff_manual <- test_diff(data_imp, type = "manual", 
#                              test = c("Ubi4_vs_Ctrl", "Ubi6_vs_Ctrl"))

Now, we need to set cutoffs for calling a protein differentially expressed: - logFC -> 1 - adjusted p-value -> 0.05

So, proteins with (logFC > 1 & adj.pvalue < 0.05) will be upregulated, and proteins with (logFC < -1 & adj.pvalue < 0.05) will be downregulated.

We also need to correct for multiple testing, which in DEP is done by fdrtools.

# Denote significant proteins based on user defined cutoffs
dep <- add_rejections(data_diff, alpha = 0.05, lfc = log2(1))

Data visualization

Now let’s see how our differentially expressed proteins look like:

Correlation matrix

# Plot the Pearson correlation matrix
plot_cor(dep, significant = TRUE, lower = 0, upper = 1, pal = "Reds")

Heatmap of significant proteins

# Plot a heatmap of all significant proteins with the data centered per protein
plot_heatmap(dep, type = "centered", kmeans = TRUE, 
             k = 6, col_limit = 4, show_row_names = FALSE,
             indicate = c("condition", "replicate"))

Heatmap of contrasts

# Plot a heatmap of all significant proteins (rows) and the tested contrasts (columns)
plot_heatmap(dep, type = "contrast", kmeans = TRUE, 
             k = 6, col_limit = 10, show_row_names = FALSE)

Volcano plots

Ubi1 vs Ctrl

plot_volcano(dep, contrast = "Ubi1_vs_Ctrl", label_size = 2, add_names = TRUE, adjusted = TRUE)

Ubi4 vs Ctrl

plot_volcano(dep, contrast = "Ubi4_vs_Ctrl", label_size = 2, add_names = TRUE, adjusted = TRUE)

Ubi6 vs Ctrl

plot_volcano(dep, contrast = "Ubi6_vs_Ctrl", label_size = 2, add_names = TRUE, adjusted = TRUE)

Barplots of single proteins

We can also plot how proteins of our interest behave across conditions:

plot_single(dep, proteins = "USP15")

plot_single(dep, proteins = "USP15", type = "centered")

Plot_all()

Several types of plots can be also generated at once, using the plot_all command:

#plot_all(dep, plots = c("volcano", "heatmap", "single", "freq", "comparison"))

Results table

Now, we will generate table with the results of differential expression:

data_results <- get_results(dep)

# We can also get data in wide or long format:
# df_wide <- get_df_wide(dep)
# df_long <- get_df_long(dep)

For better work, we can also generate interactive table:

data_results %>%
  select(ID, name, ends_with(c('_p.val', '_p.adj', '_ratio'))) %>%
    datatable(extensions = 'Buttons',
            options = list(dom = 'Blfrtip',
                           buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
                           lengthMenu = list(c(10,25,50,-1),
                                             c(10,25,50,"All"))))

We can also save the workspace for further use:

save(data_se, data_norm_vsn, data_imp, data_diff, dep, file = here('outputs', "data.RData"))
# These data can be loaded in future R sessions using this command
# load("data.RData")

SessionInfo()

Last, for the reproducibility, we should also save information about which packages we used using the SessionInfo() command:

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Czech_Czechia.1250  LC_CTYPE=Czech_Czechia.1250   
## [3] LC_MONETARY=Czech_Czechia.1250 LC_NUMERIC=C                  
## [5] LC_TIME=Czech_Czechia.1250    
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.24.0 Biobase_2.54.0             
##  [3] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
##  [5] IRanges_2.28.0              S4Vectors_0.32.4           
##  [7] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
##  [9] matrixStats_0.62.0          DT_0.22                    
## [11] dplyr_1.0.8                 DEP_1.16.0                 
## [13] here_1.0.1                 
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-3       rjson_0.2.21           ellipsis_0.3.2        
##   [4] rprojroot_2.0.3        circlize_0.4.14        XVector_0.34.0        
##   [7] GlobalOptions_0.1.2    clue_0.3-60            rstudioapi_0.13       
##  [10] hexbin_1.28.2          farver_2.1.0           mzR_2.28.0            
##  [13] affyio_1.64.0          ggrepel_0.9.1          fansi_1.0.3           
##  [16] mvtnorm_1.1-3          codetools_0.2-18       ncdf4_1.19            
##  [19] doParallel_1.0.17      impute_1.68.0          knitr_1.39            
##  [22] jsonlite_1.8.0         cluster_2.1.3          vsn_3.62.0            
##  [25] png_0.1-7              shinydashboard_0.7.2   shiny_1.7.1           
##  [28] BiocManager_1.30.16    readr_2.1.2            compiler_4.1.2        
##  [31] assertthat_0.2.1       Matrix_1.4-1           fastmap_1.1.0         
##  [34] gmm_1.6-6              limma_3.50.3           cli_3.2.0             
##  [37] later_1.3.0            htmltools_0.5.2        tools_4.1.2           
##  [40] gtable_0.3.0           glue_1.6.2             GenomeInfoDbData_1.2.7
##  [43] affy_1.72.0            Rcpp_1.0.8.3           MALDIquant_1.21       
##  [46] jquerylib_0.1.4        vctrs_0.4.1            preprocessCore_1.56.0 
##  [49] crosstalk_1.2.0        iterators_1.0.14       tmvtnorm_1.5          
##  [52] xfun_0.30              stringr_1.4.0          mime_0.12             
##  [55] lifecycle_1.0.1        XML_3.99-0.9           zlibbioc_1.40.0       
##  [58] MASS_7.3-57            zoo_1.8-10             scales_1.2.0          
##  [61] MSnbase_2.20.1         pcaMethods_1.86.0      hms_1.1.1             
##  [64] promises_1.2.0.1       ProtGenerics_1.26.0    parallel_4.1.2        
##  [67] sandwich_3.0-1         RColorBrewer_1.1-3     ComplexHeatmap_2.10.0 
##  [70] yaml_2.3.5             gridExtra_2.3          ggplot2_3.3.5         
##  [73] sass_0.4.1             stringi_1.7.6          highr_0.9             
##  [76] foreach_1.5.2          BiocParallel_1.28.3    shape_1.4.6           
##  [79] rlang_1.0.2            pkgconfig_2.0.3        bitops_1.0-7          
##  [82] imputeLCMD_2.0         mzID_1.32.0            evaluate_0.15         
##  [85] lattice_0.20-45        purrr_0.3.4            labeling_0.4.2        
##  [88] htmlwidgets_1.5.4      tidyselect_1.1.2       norm_1.0-10.0         
##  [91] plyr_1.8.7             magrittr_2.0.3         R6_2.5.1              
##  [94] generics_0.1.2         DelayedArray_0.20.0    DBI_1.1.2             
##  [97] pillar_1.7.0           MsCoreUtils_1.6.2      RCurl_1.98-1.6        
## [100] tibble_3.1.6           crayon_1.5.1           fdrtool_1.2.17        
## [103] utf8_1.2.2             tzdb_0.3.0             rmarkdown_2.14        
## [106] GetoptLong_1.0.5       grid_4.1.2             digest_0.6.29         
## [109] xtable_1.8-4           tidyr_1.2.0            httpuv_1.6.5          
## [112] munsell_0.5.0          bslib_0.3.1